Seaborn is a Python data visualization library available on python, based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv('C:/Users/Admin/Desktop/Data1.csv')
df.head()
| Unnamed: 0 | ID | Salary | DOJ | DOL | Designation | JobCity | Gender | DOB | 10percentage | ... | ComputerScience | MechanicalEngg | ElectricalEngg | TelecomEngg | CivilEngg | conscientiousness | agreeableness | extraversion | nueroticism | openess_to_experience | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | train | 203097 | 420000.0 | 01-06-2012 00:00 | present | senior quality engineer | Bangalore | f | 19-02-1990 00:00 | 84.3 | ... | -1 | -1 | -1 | -1 | -1 | 0.9737 | 0.8128 | 0.5269 | 1.35490 | -0.4455 |
| 1 | train | 579905 | 500000.0 | 01-09-2013 00:00 | present | assistant manager | Indore | m | 04-10-1989 00:00 | 85.4 | ... | -1 | -1 | -1 | -1 | -1 | -0.7335 | 0.3789 | 1.2396 | -0.10760 | 0.8637 |
| 2 | train | 810601 | 325000.0 | 01-06-2014 00:00 | present | systems engineer | Chennai | f | 03-08-1992 00:00 | 85.0 | ... | -1 | -1 | -1 | -1 | -1 | 0.2718 | 1.7109 | 0.1637 | -0.86820 | 0.6721 |
| 3 | train | 267447 | 1100000.0 | 01-07-2011 00:00 | present | senior software engineer | Gurgaon | m | 05-12-1989 00:00 | 85.6 | ... | -1 | -1 | -1 | -1 | -1 | 0.0464 | 0.3448 | -0.3440 | -0.40780 | -0.9194 |
| 4 | train | 343523 | 200000.0 | 01-03-2014 00:00 | 01-03-2015 00:00 | get | Manesar | m | 27-02-1991 00:00 | 78.0 | ... | -1 | -1 | -1 | -1 | -1 | -0.8810 | -0.2793 | -1.0697 | 0.09163 | -0.1295 |
5 rows × 39 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3998 entries, 0 to 3997 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 3998 non-null object 1 ID 3998 non-null int64 2 Salary 3998 non-null float64 3 DOJ 3998 non-null object 4 DOL 3998 non-null object 5 Designation 3998 non-null object 6 JobCity 3998 non-null object 7 Gender 3998 non-null object 8 DOB 3998 non-null object 9 10percentage 3998 non-null float64 10 10board 3998 non-null object 11 12graduation 3998 non-null int64 12 12percentage 3998 non-null float64 13 12board 3998 non-null object 14 CollegeID 3998 non-null int64 15 CollegeTier 3998 non-null int64 16 Degree 3998 non-null object 17 Specialization 3998 non-null object 18 collegeGPA 3998 non-null float64 19 CollegeCityID 3998 non-null int64 20 CollegeCityTier 3998 non-null int64 21 CollegeState 3998 non-null object 22 GraduationYear 3998 non-null int64 23 English 3998 non-null int64 24 Logical 3998 non-null int64 25 Quant 3998 non-null int64 26 Domain 3998 non-null float64 27 ComputerProgramming 3998 non-null int64 28 ElectronicsAndSemicon 3998 non-null int64 29 ComputerScience 3998 non-null int64 30 MechanicalEngg 3998 non-null int64 31 ElectricalEngg 3998 non-null int64 32 TelecomEngg 3998 non-null int64 33 CivilEngg 3998 non-null int64 34 conscientiousness 3998 non-null float64 35 agreeableness 3998 non-null float64 36 extraversion 3998 non-null float64 37 nueroticism 3998 non-null float64 38 openess_to_experience 3998 non-null float64 dtypes: float64(10), int64(17), object(12) memory usage: 1.0+ MB
df['Unnamed: 0'].unique()
array(['train'], dtype=object)
Unnamed: 0 as it has the same value and column ID as it has unique values¶df.drop(columns=['Unnamed: 0','ID'],inplace=True,)
df.head()
| Salary | DOJ | DOL | Designation | JobCity | Gender | DOB | 10percentage | 10board | 12graduation | ... | ComputerScience | MechanicalEngg | ElectricalEngg | TelecomEngg | CivilEngg | conscientiousness | agreeableness | extraversion | nueroticism | openess_to_experience | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 420000.0 | 01-06-2012 00:00 | present | senior quality engineer | Bangalore | f | 19-02-1990 00:00 | 84.3 | board ofsecondary education,ap | 2007 | ... | -1 | -1 | -1 | -1 | -1 | 0.9737 | 0.8128 | 0.5269 | 1.35490 | -0.4455 |
| 1 | 500000.0 | 01-09-2013 00:00 | present | assistant manager | Indore | m | 04-10-1989 00:00 | 85.4 | cbse | 2007 | ... | -1 | -1 | -1 | -1 | -1 | -0.7335 | 0.3789 | 1.2396 | -0.10760 | 0.8637 |
| 2 | 325000.0 | 01-06-2014 00:00 | present | systems engineer | Chennai | f | 03-08-1992 00:00 | 85.0 | cbse | 2010 | ... | -1 | -1 | -1 | -1 | -1 | 0.2718 | 1.7109 | 0.1637 | -0.86820 | 0.6721 |
| 3 | 1100000.0 | 01-07-2011 00:00 | present | senior software engineer | Gurgaon | m | 05-12-1989 00:00 | 85.6 | cbse | 2007 | ... | -1 | -1 | -1 | -1 | -1 | 0.0464 | 0.3448 | -0.3440 | -0.40780 | -0.9194 |
| 4 | 200000.0 | 01-03-2014 00:00 | 01-03-2015 00:00 | get | Manesar | m | 27-02-1991 00:00 | 78.0 | cbse | 2008 | ... | -1 | -1 | -1 | -1 | -1 | -0.8810 | -0.2793 | -1.0697 | 0.09163 | -0.1295 |
5 rows × 37 columns
df.isnull().sum()
Salary 0 DOJ 0 DOL 0 Designation 0 JobCity 0 Gender 0 DOB 0 10percentage 0 10board 0 12graduation 0 12percentage 0 12board 0 CollegeID 0 CollegeTier 0 Degree 0 Specialization 0 collegeGPA 0 CollegeCityID 0 CollegeCityTier 0 CollegeState 0 GraduationYear 0 English 0 Logical 0 Quant 0 Domain 0 ComputerProgramming 0 ElectronicsAndSemicon 0 ComputerScience 0 MechanicalEngg 0 ElectricalEngg 0 TelecomEngg 0 CivilEngg 0 conscientiousness 0 agreeableness 0 extraversion 0 nueroticism 0 openess_to_experience 0 dtype: int64
df['DOL'].value_counts()
present 1875
01-04-2015 00:00 573
01-03-2015 00:00 124
01-05-2015 00:00 112
01-01-2015 00:00 99
...
01-03-2005 00:00 1
01-02-2011 00:00 1
01-10-2015 00:00 1
01-10-2010 00:00 1
01-08-2011 00:00 1
Name: DOL, Length: 67, dtype: int64
DOL¶df["DOL"].replace({"present": pd.datetime.now()}, inplace=True)
df['DOL']=pd.to_datetime(df['DOL'])
df['DOL'].value_counts()
2021-02-05 11:16:31.079185 1875
2015-01-04 00:00:00.000000 573
2015-01-03 00:00:00.000000 124
2015-01-05 00:00:00.000000 112
2015-01-01 00:00:00.000000 99
...
2011-01-08 00:00:00.000000 1
2011-01-02 00:00:00.000000 1
2009-01-06 00:00:00.000000 1
2010-01-02 00:00:00.000000 1
2005-01-03 00:00:00.000000 1
Name: DOL, Length: 67, dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3998 entries, 0 to 3997 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Salary 3998 non-null float64 1 DOJ 3998 non-null object 2 DOL 3998 non-null datetime64[ns] 3 Designation 3998 non-null object 4 JobCity 3998 non-null object 5 Gender 3998 non-null object 6 DOB 3998 non-null object 7 10percentage 3998 non-null float64 8 10board 3998 non-null object 9 12graduation 3998 non-null int64 10 12percentage 3998 non-null float64 11 12board 3998 non-null object 12 CollegeID 3998 non-null int64 13 CollegeTier 3998 non-null int64 14 Degree 3998 non-null object 15 Specialization 3998 non-null object 16 collegeGPA 3998 non-null float64 17 CollegeCityID 3998 non-null int64 18 CollegeCityTier 3998 non-null int64 19 CollegeState 3998 non-null object 20 GraduationYear 3998 non-null int64 21 English 3998 non-null int64 22 Logical 3998 non-null int64 23 Quant 3998 non-null int64 24 Domain 3998 non-null float64 25 ComputerProgramming 3998 non-null int64 26 ElectronicsAndSemicon 3998 non-null int64 27 ComputerScience 3998 non-null int64 28 MechanicalEngg 3998 non-null int64 29 ElectricalEngg 3998 non-null int64 30 TelecomEngg 3998 non-null int64 31 CivilEngg 3998 non-null int64 32 conscientiousness 3998 non-null float64 33 agreeableness 3998 non-null float64 34 extraversion 3998 non-null float64 35 nueroticism 3998 non-null float64 36 openess_to_experience 3998 non-null float64 dtypes: datetime64[ns](1), float64(10), int64(16), object(10) memory usage: 999.6+ KB
df.isnull().sum()
Salary 0 DOJ 0 DOL 0 Designation 0 JobCity 0 Gender 0 DOB 0 10percentage 0 10board 0 12graduation 0 12percentage 0 12board 0 CollegeID 0 CollegeTier 0 Degree 0 Specialization 0 collegeGPA 0 CollegeCityID 0 CollegeCityTier 0 CollegeState 0 GraduationYear 0 English 0 Logical 0 Quant 0 Domain 0 ComputerProgramming 0 ElectronicsAndSemicon 0 ComputerScience 0 MechanicalEngg 0 ElectricalEngg 0 TelecomEngg 0 CivilEngg 0 conscientiousness 0 agreeableness 0 extraversion 0 nueroticism 0 openess_to_experience 0 dtype: int64
df.shape
(3998, 37)
df.dtypes
Salary float64 DOJ object DOL datetime64[ns] Designation object JobCity object Gender object DOB object 10percentage float64 10board object 12graduation int64 12percentage float64 12board object CollegeID int64 CollegeTier int64 Degree object Specialization object collegeGPA float64 CollegeCityID int64 CollegeCityTier int64 CollegeState object GraduationYear int64 English int64 Logical int64 Quant int64 Domain float64 ComputerProgramming int64 ElectronicsAndSemicon int64 ComputerScience int64 MechanicalEngg int64 ElectricalEngg int64 TelecomEngg int64 CivilEngg int64 conscientiousness float64 agreeableness float64 extraversion float64 nueroticism float64 openess_to_experience float64 dtype: object
sns.distplot(df['Salary'])
<AxesSubplot:xlabel='Salary', ylabel='Density'>
A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.
Point plots can be more useful than bar plots for focusing comparisons
between different levels of one or more categorical variables. They are
particularly adept at showing interactions: how the relationship between
levels of one categorical variable changes across levels of a second
categorical variable. The lines that join each point from the same hue
level allow interactions to be judged by differences in slope, which is
easier for the eyes than comparing the heights of several groups of points
or bars.
sns.pointplot(x = df['Degree'].unique(),y = df['Degree'].value_counts(),data=df)
<AxesSubplot:ylabel='Degree'>
sns.pointplot(x="Degree", y="12graduation", hue="Gender",data=df)
<AxesSubplot:xlabel='Degree', ylabel='12graduation'>
The Count plot shows the counts of observations in each categorical bin using bars.
A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.
sns.countplot('Degree', data=df)
<AxesSubplot:xlabel='Degree', ylabel='count'>
sns.countplot('CollegeState',data=df)
plt.xticks(rotation=90)
plt.show()
sns.countplot(x='CollegeState',data=df)
plt.xticks(rotation=90)
plt.show()
sns.countplot(x='CollegeCityTier',hue = 'Gender', data=df)
<AxesSubplot:xlabel='CollegeCityTier', ylabel='count'>
sns.barplot(y='Salary',x='Gender', data=df)
<AxesSubplot:xlabel='Gender', ylabel='Salary'>
sns.boxplot(x='Degree',y='12graduation',hue='Gender',data=df)
<AxesSubplot:xlabel='Degree', ylabel='12graduation'>
sns.boxplot(x='Degree',y='CollegeTier', data=df, palette='rainbow')
<AxesSubplot:xlabel='Degree', ylabel='CollegeTier'>
sns.boxplot(df['CollegeState'],df['12graduation'])
plt.xticks(rotation=90)
plt.show()
Seaborn’s jointplot displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins. This plot is a convenience class that wraps JointGrid.
sns.jointplot(x='12graduation',y='CollegeTier',data=df,kind='hex')
<seaborn.axisgrid.JointGrid at 0x6128418>
sns.jointplot(x='12graduation',y='CollegeTier',data=df,kind='reg')
<seaborn.axisgrid.JointGrid at 0xba5db50>
Draw a combination of boxplot and kernel density estimate.
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.
sns.violinplot(y='Degree',x='Salary', data=df, palette='rainbow')
<AxesSubplot:xlabel='Salary', ylabel='Degree'>
sns.violinplot(x='Degree',y='collegeGPA',hue='Gender',data=df)
<AxesSubplot:xlabel='Degree', ylabel='collegeGPA'>
A scatterplot is perhaps the most common example of visualizing relationships between two variables. Each point shows an observation in the dataset and these observations are represented by dot-like structures. The plot shows the joint distribution of two variables using a cloud of points.
sns.scatterplot(x="10percentage", y="Salary", data = df)
<AxesSubplot:xlabel='10percentage', ylabel='Salary'>
sns.scatterplot(x="10percentage", y="Salary", hue = "Gender", data = df)
<AxesSubplot:xlabel='10percentage', ylabel='Salary'>
A Boxen Plot is an an enhanced box plot for larger datasets.
This style of plot was originally named a "letter value" plot because it shows a large number of quantiles that are defined as "letter values". It is similar to a box plot in plotting a nonparametric representation of a distribution in which all features correspond to actual observations. By plotting more quantiles, it provides more information about the shape of the distribution, particularly in the tails.
sns.boxenplot(x='Degree',y='Salary',data=df)
plt.xticks(rotation=90)
plt.show()
sns.boxenplot(x='CollegeState',y='Salary',data=df)
plt.xticks(rotation=90)
plt.show()
A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.
sns.stripplot(x='Degree',y='Salary',data=df)
plt.xticks(rotation=90)
plt.show()
sns.stripplot(x='Gender',y='12graduation',data=df)
plt.xticks(rotation=90)
plt.show()
sns.stripplot(x='Degree',y='12graduation',data=df)
plt.xticks(rotation=90)
plt.show()
A Pairs Plot is also know as scatterplot, in which one variable in the same data row is matched with another variable's value, like this: Pairs plots are just elaborations on this,showing all variables paired with all other variables
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0xcf57b38>